Minimax Regret Bounds for Reinforcement Learning
نویسندگان
چکیده
We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of Õ( √ HSAT+HSA+H √ T ) where H is the time horizon, S the number of states, A the number of actions and T the number of timesteps. This result improves over the best previous known bound Õ(HS √ AT ) achieved by the UCRL2 algorithm of Jaksch et al. (2010). The key significance of our new results is that when T ≥ HSA and SA ≥ H , it leads to a regret of Õ( √ HSAT ) that matches the established lower bound of Ω( √ HSAT ) up to a logarithmic factor. Our analysis contains two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in S), and we define Bernstein-based "exploration bonuses" that use the empirical variance of the estimated values at the next states (to improve scaling in H).
منابع مشابه
What Doubling Tricks Can and Can't Do for Multi-Armed Bandits
An online reinforcement learning algorithm is anytime if it does not need to know in advance the horizon T of the experiment. A well-known technique to obtain an anytime algorithm from any nonanytime algorithm is the “Doubling Trick”. In the context of adversarial or stochastic multi-armed bandits, the performance of an algorithm is measured by its regret, and we study two families of sequences...
متن کاملVariance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs
The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of th...
متن کاملOptimal Non-Asymptotic Lower Bound on the Minimax Regret of Learning with Expert Advice
We prove non-asymptotic lower bounds on the expectation of the maximum of d independent Gaussian variables and the expectation of the maximum of d independent symmetric random walks. Both lower bounds recover the optimal leading constant in the limit. A simple application of the lower bound for random walks is an (asymptotically optimal) non-asymptotic lower bound on the minimax regret of onlin...
متن کاملPartial Monitoring - Classification, Regret Bounds, and Algorithms
In a partial monitoring game, the learner repeatedly chooses an action, the environment responds with an outcome, and then the learner suffers a loss and receives a feedback signal, both of which are fixed functions of the action and the outcome. The goal of the learner is to minimize his regret, which is the difference between his total cumulative loss and the total loss of the best fixed acti...
متن کاملEfficient Policy Learning
We consider the problem of using observational data to learn treatment assignment policies that satisfy certain constraints specified by a practitioner, such as budget, fairness, or functional form constraints. This problem has previously been studied in economics, statistics, and computer science, and several regret-consistent methods have been proposed. However, several key analytical compone...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017